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Abstract 

In this paper, the author proposes a series of multilevel double hashing schemes 
called cascade hash tables. They use several levels of hash tables. In each table, we 
use the common double hashing scheme. Higher level hash tables work as fail-safes 
of lower level hash tables. By this strategy, it could effectively reduce collisions in 
hash insertion. Thus it gains a constant worst case lookup time with a relatively high 
load factor(70% — 85%) in random experiments. Different parameters of cascade hash 
tables are tested. 

1 Introduction 

Hash table is a common data structure used in large set of data storage and retrieval. 
It has an 0(1) lookup time on average, but the worst case lookup time can be as bad 
as 0(N)(N is the size of the hash table). Such a time variation is essentially caused by 
possibly many collisions during keys' hashing. In this paper, we present a set of hash table 
schemes called cascade hash tables which consist of several levels(l — 12) of hash tables 
with different size. After constant probes, if an item can't find a free "cell" (slot) in the 
first level table, it will try to find a cell in the second level, or subsequent lower levels. 
With this simple strategy, these hash tables will have descendent load factors, therefore 
descendent collision probabilities. So finally the probability that the item cannot find an 
empty cell in any hash table is slight. This enables the whole hash table to reach a high 
load factor with constant probes in random generated test set, before a crisis(the situation 
that when an item comes, we cannot find a free slot in limited probes in any hash table) 
happens. 



2 Common hash table schemes [4 J 



A hash table is a data structure that associates keys with values. The primary 
operation it supports efficiently is a lookup: given a key (e.g. a person's name), find the 
corresponding value (e.g. that person's telephone number). It works by transforming the 
key using a hash function into a hash — a number that the hash table uses to locate the 
desired value. A hash function is a many-to-one mapping, which maps keys in a large 
domain to hashes in a relative small range. So collisions among keys which are mapped to 
the same hash are inevitable. Differences among hash table schemes lie in hash function 
and collision resolution strategy. 

2.1 Hash function 

Generally a string type key is hashed into an integer by a hash function, then mapped 
into an index not bigger than the table size(a common method is to compute the hash 
value modulo the table size. There are various hash functions on strings, such as CRC, 
lookup2 and MD5. As to integer type keys, they are directly mapped into indices. 

2.2 Collision resolution 

If two keys hash to the same index, the corresponding records cannot be stored in 
the same location. So, if it's already occupied, we must find another location to store the 
new record, and do it so that we can find it when we look it up later on. 

The most popular collision resolution techniques are chaining and open addressing. 

In the chained hash table technique, each slot in the array references a linked list of 
inserted records that collide to the same slot. Insertion requires finding the correct slot, 
and appending to either end of the list in that slot; deletion requires searching the list 
and removal. This technique is intuitive and the performance degrades gracefully when 
the load factor increases. But if the record size is small, the overhead of the linked list 
is significant. Additionally, traversing a linked list has poor cache performance. Some 
popular hash table implementations, such as STL, use this technique. 

Open addressing hash tables store the colliding records directly within the array. A 
hash collision is resolved by probing through alternate locations in the array(the probe 
sequence) until either the target record is found, or an unused array slot is found, which 
indicates that there is no such key in the table. Well known probe sequences include: linear 
probing, in which the interval between probes is fixed — often at 1; quadratic probing, in 
which the interval between probes increases linearly (hence, the indices are described by 
a quadratic function); double hashing, in which the interval between probes is fixed for 
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Figure 1: A failed insertion into 3-level cascade hash table(l-12 are probe sequences, cells 
a, ■ ■ ■ , I are all occupied) 

each record but is computed by another hash function. Any of these methods may probe 
indefinite number of locations, even as many as N in the worst case! We call these methods 
unlimited. A method is limited, if the number of probes cannot exceed some limit. 

3 Implementation 

In the implementation of M-level cascade hash table, we have M hash tables, and 
use limited double hashing in every level of table. In turn, the hash table size is half of its 
preceding hash table(the proportion 1/2 is chosen empirically). We limit the total number 
of probes to 12. Thus in every level, the probe number is p = 12/M. Here M is a factor 
of 12, so M S {1, 2, 3, 4, 6, 12}. If an item can't find a free cell in Level 1 in p probes, it 
will probe in Level 2, and if still with bad luck, it turns to search lower levels. 

If a crisis happens, the hash table will be enlarged and rehashed. 

The lookup procedure is similar to the insertion procedure. It also takes not more 
than 12 steps. 

Clearly, insertion and lookup both take at most 12 probes, so the time complexity of 
cascade hash table is 0(1). 

Specially, when M = 1, it's the ordinary (limited) double-hashing scheme. When 
M = 12(one probe every level), it's the "multilevel adaptive hashing" scheme presented 
by paper pQ. Experiments show that it is not the best configuration. 
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4 Experiment results 



By experiments on random data sets, we get the result as in Table ^ |2] and 03 



M 


N 


n* 


L 


nx/Nx, • • ■ ,n m /N m 


1 


1572869 


580218 


36.89% 


580218/1572869 
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1376287 


1065756 


77.44% 


754587/786433, 290176/393241, 20993/196613 
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1474604 


1209981 


82.05% 


f D4U01/ ( o04oo, o4 1 110/ oyoZ41, 

104524/196613, 4278/98317 


6 


1548354 


1356218 


87.59% 


737498/786433, 360648/393241, 170138/196613 : 
70123/98317, 17085/49157, 726/24593 


12 


1572574 


1237520 


78.69% 


623705/786433, 310791/393241, 154514/196613, 
76749/98317, 37674/49157, 18418/24593, 
8885/12289, 4080/6151, 1819/3079, 
658/1543, 197/769, 30/389 



n: the number of items in hash table when a "crisis" happens 
Table 1: Experiment result 1 



M 


N 


n* 


L 


nx/Nx, ■ ■ ■ ,n m /N m 


1 


6291469 


2134465 


33.93% 


2134465/6291469 


3 


5505041 


4221564 


76.69% 


3011794/3145739, 1134328/1572869, 75442/786433 


4 


5898282 


4925580 


83.51% 


3026162/3145739, 1409371/1572869, 
462594/786433, 27453/393241 


6 


6193212 


5428347 


87.65% 


2951068/3145739, 1443047/1572869, 680299/786433, 
281403/393241, 69400/196613, 3130/98317 


12 


6290024 


4929305 


78.37% 


2490323/3145739, 1239424/1572869, 615131/786433, 
304273/393241, 149215/196613, 72349/98317, 
34201/49157, 15488/24593, 6382/12289, 
2097/6151, 398/3079, 24/1543 



Table 2: Experiment result 2 

From these tables, we can see that in a certain level of a multiple level hash table, 
the load factor decreases drastically. Take three-level hash table for example, when hash 
tables are "full" (not really full, but we cannot insert the coming new item into it in limited 
probes), nx/Nx is around 0.95, 712/-/V2 is 0.7 ~ 0.72, but n^/N^ is only around 0.1. So if 
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M 


N 


* 

n 


L 


ni/Ni, ■ ■ ■ ,n m /N m 


1 


12582917 


3686536 


29.30% 


580218/1572869 
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H010077 


8374882 


76.07% 


6011507/6291469, 2229292/3145739, 134083/1572869 
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H796510 


9768044 


82.80% 


6042092/6291469, 2799205/3145739, 
881930/1572869, 44817/786433 
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12386364 


10698813 


86.38% 


5882616/6291469, 2863408/3145739, 1329154/1572869, 
518523/786433, 102712/393241, 2400/196613 


12 


12579950 


9872865 


78.48% 


4982132/6291469, 2480713/3145739, 1232466/1572869, 
610866/786433, 300339/393241, 146020/196613, 
69541/98317, 31758/49157, 13286/24593, 
4562/12289, 1095/6151, 87/3079 



Table 3: Experiment result 3 



a new item comes, the crisis rate (the probability that the item can't find an empty room 
in any hash table) is not bigger than 0.95 4 x 0.72 4 x 0.1 4 = 0.000024. 

Given a one level hash table, assume it has a load factor of 76%, then it will take 
at least /o<?o.760. 000024 = 39 probes on average to obtain the same small crisis rate(which 
ensures a high load factor) as 3-level cascade hash table. But cascade hash table just 
makes 12 probes meanwhile. 

To our surprise, the space efficiency of M-level hash table doesn't increase mono- 
tonically with M, At M = 6, the space efficiency hits the climax, then falls down at 
M = 12. 

5 Conclusion 

In this paper, we introduce a series of hash table schemes — cascade hash tables. 
It uses M levels of hash tables; in every level, we use limited double hashing to make 
probes. Smaller hash tables work as fail-safes of bigger hash tables. Roughly speaking, 
different tables are similar with sieves with holes in different shapes. We hope no object 
escape through these sieves. With more sieves, the chance that an object is screened in 
some level is bigger. The idea is simple, but its performance exceeds the ordinary one 
level hash table dramatically when there are more than 3 levels. By choosing M = 6, it's 
much better than the hash scheme proposed in paper jT]. 

Obviously, if we permit a larger total probe count, we can achieve higher load factor. 
But the average speed will be slower. So a user can choose an appropriate configuration 
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which balances best between speed and space efficiency to him. 
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